隨機森林是以「多棵弱學習器 (決策樹)」為基底的集成學習 (Ensemble) 方法,透過資料抽樣 (Bagging) 與特徵隨機子抽樣 (Random Subspace) 降低單棵樹的方差與不穩定性。直覺上,它像是一群專家各自投票: 每位專家 (樹) 看見的資料與特徵都不完全相同,最後以多數決 (分類) 或平均 (回歸) 給出更穩健的預測。
優點 | 缺點 |
---|---|
對高維與雜訊相對穩健,較不易過擬合 (相對單樹) | 訓練與推論成本高於單樹,難以極致壓縮 |
幾乎不需特徵縮放,能處理數值+類別混合 | 全模型可解釋性較低 (可用特徵重要度、Permutation Importance 緩解) |
內建 OOB 評估、特徵重要度 | 對極度不平衡資料仍可能偏向多數類 (需調 class_weight 或重抽樣) |
易於平行化、對超參數敏感度相對低 | 單棵樹可視覺化但「整體」難以直觀解釋 |
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
accuracy_score, classification_report,
roc_auc_score, average_precision_score
)
from sklearn.inspection import permutation_importance
# -----------------------------
# 1) 載入資料(無需外部檔案)
# -----------------------------
df = sns.load_dataset("titanic")
# 2) 目標與特徵(seaborn titanic 欄位)
target = "survived"
num_features = ["age", "fare", "pclass", "sibsp", "parch"]
cat_features = ["sex", "class", "embarked", "who", "adult_male", "alone"]
X = df[num_features + cat_features]
y = df[target].astype(int)
# -----------------------------
# 3) 前處理:數值補中位數、類別補眾數 + One-Hot
# -----------------------------
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median"))
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocess = ColumnTransformer(
transformers=[
("num", numeric_transformer, num_features),
("cat", categorical_transformer, cat_features),
]
)
# -----------------------------
# 4) 隨機森林模型
# -----------------------------
rf = RandomForestClassifier(
n_estimators=400,
max_depth=None,
min_samples_leaf=2,
max_features="sqrt",
class_weight="balanced", # Titanic 類別略不平衡
oob_score=True,
n_jobs=-1,
random_state=42
)
pipe = Pipeline(steps=[("preprocess", preprocess),
("model", rf)])
# -----------------------------
# 5) 切分資料與訓練
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipe.fit(X_train, y_train)
# -----------------------------
# 6) 評估指標
# -----------------------------
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:, 1]
print("Test Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print(classification_report(y_test, y_pred, digits=4))
print("ROC-AUC:", round(roc_auc_score(y_test, y_proba), 4))
print("PR-AUC (Average Precision):", round(average_precision_score(y_test, y_proba), 4))
rf_model = pipe.named_steps["model"]
if hasattr(rf_model, "oob_score_"):
print("OOB Score:", round(rf_model.oob_score_, 4))
# -----------------------------
# 7) 取得「展開後特徵名」
# - 與模型看到的列數對齊,避免長度不一致錯誤
# -----------------------------
# One-Hot 展開後的類別名稱
ohe = pipe.named_steps["preprocess"].named_transformers_["cat"].named_steps["onehot"]
ohe_names = list(ohe.get_feature_names_out(cat_features))
all_feature_names = num_features + ohe_names
# 與模型實際特徵數對齊(保險做法)
n_model_features = rf_model.n_features_in_
if len(all_feature_names) != n_model_features:
# 若 ColumnTransformer 產生的欄位數與推定不一致(理論上不會),則切齊
all_feature_names = all_feature_names[:n_model_features]
# -----------------------------
# 8) 特徵重要度(MDI)
# -----------------------------
mdi_importance = pd.DataFrame({
"feature": all_feature_names,
"importance": rf_model.feature_importances_[:len(all_feature_names)]
}).sort_values("importance", ascending=False).head(20)
print("\nTop MDI Importances:")
print(mdi_importance.to_string(index=False))
# -----------------------------
# 9) Permutation Importance(在「前處理後特徵空間」計算)
# 這樣 perm.importances_* 的長度就會與 all_feature_names 完全一致
# -----------------------------
X_test_trans = pipe.named_steps["preprocess"].transform(X_test) # 稀疏或稠密皆可
estimator = pipe.named_steps["model"] # 直接用 RF 在 transformed space 上做 PI
perm = permutation_importance(
estimator, X_test_trans, y_test,
n_repeats=10, random_state=42, n_jobs=-1
)
perm_importance = pd.DataFrame({
"feature": all_feature_names,
"importance_mean": perm.importances_mean[:len(all_feature_names)],
"importance_std": perm.importances_std[:len(all_feature_names)],
}).sort_values("importance_mean", ascending=False).head(20)
print("\nTop Permutation Importances (transformed space):")
print(perm_importance.to_string(index=False))
Test Accuracy: 0.8324
precision recall f1-score support
0 0.8509 0.8818 0.8661 110
1 0.8000 0.7536 0.7761 69
accuracy 0.8324 179
macro avg 0.8254 0.8177 0.8211 179
weighted avg 0.8313 0.8324 0.8314 179
ROC-AUC: 0.8515
PR-AUC (Average Precision): 0.8335
OOB Score: 0.8216
Top MDI Importances:
feature importance
fare 0.194484
age 0.145927
adult_male_False 0.115839
adult_male_True 0.080458
sex_female 0.075092
who_man 0.074631
pclass 0.047228
sex_male 0.046897
sibsp 0.037505
class_Third 0.035858
who_woman 0.029911
class_First 0.025534
parch 0.022043
embarked_S 0.017173
embarked_C 0.011033
class_Second 0.010601
alone_False 0.009373
alone_True 0.007693
embarked_Q 0.006361
who_child 0.006358
Top Permutation Importances (transformed space):
feature importance_mean importance_std
fare 0.070950 0.013466
age 0.031285 0.014823
adult_male_True 0.021788 0.008815
adult_male_False 0.021788 0.008815
who_man 0.021788 0.008815
embarked_S 0.012849 0.006634
class_Third 0.006145 0.008076
embarked_C 0.005587 0.002498
sex_male 0.005587 0.004327
who_woman 0.005028 0.005833
sibsp 0.004469 0.007821
sex_female 0.003911 0.005028
who_child 0.003352 0.003706
parch 0.002793 0.005151
embarked_Q 0.001676 0.003577
pclass -0.000559 0.010133
class_Second -0.004469 0.006017
alone_True -0.005028 0.009497
class_First -0.006704 0.008939
alone_False -0.008939 0.008727
隨機森林作為一種集成學習方法,在本篇示例中展現了穩健的泛化能力與優異的整體效能。透過 Bagging 與 隨機特徵子抽樣,它有效降低了單棵決策樹易受資料擾動影響的缺點,並減少了過擬合的風險。與單樹相比,隨機森林在測試集的 Accuracy、ROC-AUC、PR-AUC 及穩定性上皆有顯著提升,且 OOB 分數與測試表現接近,反映出其在實務環境中的可靠性。
在實務應用中,隨機森林適合用於需要 快速獲得高表現基準模型 (Strong Baseline) 的情境,例如金融風控、醫療診斷、客戶行為預測等。它對資料前處理的要求低、能處理數值與類別混合特徵、且具備穩定性,特別適合在專案初期作為效能與穩健性的參考標竿。未來若需追求更高的精度或更細緻的可解釋性,可以在隨機森林基線上延伸至梯度提升樹 (XGBoost、LightGBM、CatBoost) 或結合 模型解釋技術 (PDP、SHAP),在效能與理解之間找到最適平衡點。